library(fivethirtyeight)
data(drinks)
# or download directly
alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")skim(alcohol_direct)| Name | alcohol_direct |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
As the report on n_missing in the above output shows, there don’t seem to be any missing values for any variable in the data set. Moreover, we can observe that the only non-numeric variable is country, while the four remaining are numeric.
# YOUR CODE GOES HERE
beer_consumtion <- alcohol_direct %>%
slice_max(order_by = beer_servings, n = 25)
ggplot(beer_consumtion, aes(y=fct_reorder(country,beer_servings), x=beer_servings))+
geom_col(fill = "orange")+
theme_bw( )+
labs(title = "Beer Consumption", subtitle = "The top beer consuming countries", x="Annual Servings", y="Country")wine_consumption <- alcohol_direct %>%
slice_max(order_by = wine_servings, n = 25)
ggplot(wine_consumption, aes(y=fct_reorder(country,wine_servings), x=wine_servings))+
geom_col(fill = "darkred")+
theme_bw( )+
labs(title = "Wine Consumption", subtitle = "The top wine consuming countries", x="Annual Servings", y="Country")# YOUR CODE GOES HERE
spirit_consumtion <- alcohol_direct %>%
slice_max(order_by = spirit_servings, n = 25)
ggplot(spirit_consumtion, aes(y=fct_reorder(country,spirit_servings), x=spirit_servings))+
geom_col(fill = "lightblue")+
theme_bw( )+
labs(title = "Spirit Consumption", subtitle = "The top spirit consuming countries", x="Annual Servings", y="Country")In general, there tends to be no overlap among countries with regards to the top spots in alcohol consumption, i.e. the top beer consuming countries are not the top wine consuming countries. It is surprising that Germany is only ranked #4 in the top beer consuming countries, after Namibia, Czech Republic and Gabon. A quick research on the reason for Namibia’s high beer consumption hint at its former colonial ties with Germany (see here. Also, one can observe that the top wine consuming countries are almost exclusively European, which might be due to the reason that wine is culturally rooted in these countries and many of them are also known for their production of wine.
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
skim(movies)| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
There are no missing values as can be observed when analysing n_missing, however, there are duplicate values for some variables. What is especially interesting, is that there are duplicate Titles for movies. This can be observed through looking at n_unique: Even though there are a total of 2961 records, there only seem to be 2907 unique movie titles.
count_moviesByGenre<-movies%>%
group_by(genre)%>%
count(sort=TRUE)%>%
rename("number_of_movies" = n)
count_moviesByGenre## # A tibble: 17 × 2
## # Groups: genre [17]
## genre number_of_movies
## <chr> <int>
## 1 Comedy 848
## 2 Action 738
## 3 Drama 498
## 4 Adventure 288
## 5 Crime 202
## 6 Biography 135
## 7 Horror 131
## 8 Animation 35
## 9 Fantasy 28
## 10 Documentary 25
## 11 Mystery 16
## 12 Sci-Fi 7
## 13 Family 3
## 14 Musical 2
## 15 Romance 2
## 16 Western 2
## 17 Thriller 1
There seems to be quite a large difference in the number of records across genres in this dataset, as there is information on 848 Comedy movies opposed to only 1 on Thrillers.
gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Rank genres by this return_on_budget in descending order.avgGenre <- movies%>%
group_by(genre)%>%
summarise(mean(gross), mean(budget))%>%
rename("Average_Gross" = "mean(gross)", "Average_Budget" = "mean(budget)")%>%
mutate(return_on_budget = Average_Gross/Average_Budget)%>%
mutate(Average_Gross = dollar(Average_Gross), Average_Budget = dollar(Average_Budget))%>% # The dollar function is from the scales package and allows us to make the numbers more easily readable.
mutate(return_on_budget = round(return_on_budget, 3))%>% #We round the return column to the 3rd nearest decimal.
arrange(desc(return_on_budget))
avgGenre## # A tibble: 17 × 4
## genre Average_Gross Average_Budget return_on_budget
## <chr> <chr> <chr> <dbl>
## 1 Musical $92,084,000 $3,189,500 28.9
## 2 Family $149,160,478 $14,833,333 10.1
## 3 Western $20,821,884 $3,465,000 6.01
## 4 Documentary $17,353,973 $5,887,852 2.95
## 5 Horror $37,713,738 $13,504,916 2.79
## 6 Fantasy $42,408,841 $17,582,143 2.41
## 7 Comedy $42,630,552 $24,446,319 1.74
## 8 Mystery $67,533,021 $39,218,750 1.72
## 9 Animation $98,433,792 $61,701,429 1.60
## 10 Biography $45,201,805 $28,543,696 1.58
## 11 Adventure $95,794,257 $66,290,069 1.44
## 12 Drama $37,465,371 $26,242,933 1.43
## 13 Crime $37,502,397 $26,596,169 1.41
## 14 Romance $31,264,848 $25,107,500 1.25
## 15 Action $86,583,860 $71,354,888 1.21
## 16 Sci-Fi $29,788,371 $27,607,143 1.08
## 17 Thriller $2,468 $300,000 0.008
topDirectors <- movies%>%
group_by(director)%>%
summarise(total_gross = sum(gross), avg_gross = mean(gross), median_gross = median(gross), sd_gross = sd(gross))%>%
slice_max(order_by = total_gross, n = 15)%>%
mutate(total_gross = dollar(total_gross), avg_gross = dollar(avg_gross), median_gross = dollar(median_gross), sd_gross = dollar(sd_gross))
topDirectors## # A tibble: 15 × 5
## director total_gross avg_gross median_gross sd_gross
## <chr> <chr> <chr> <chr> <chr>
## 1 Steven Spielberg $4,014,061,704 $174,524,422 $164,435,221 $101,421,051
## 2 Michael Bay $2,231,242,537 $171,634,041 $138,396,624 $127,161,579
## 3 Tim Burton $2,071,275,480 $129,454,718 $76,519,172 $108,726,924
## 4 Sam Raimi $2,014,600,898 $201,460,090 $234,903,076 $162,126,632
## 5 James Cameron $1,909,725,910 $318,287,652 $175,562,880 $309,171,337
## 6 Christopher Nolan $1,813,227,576 $226,653,447 $196,667,606 $187,224,133
## 7 George Lucas $1,741,418,480 $348,283,696 $380,262,555 $146,193,880
## 8 Robert Zemeckis $1,619,309,108 $124,562,239 $100,853,835 $91,300,279
## 9 Clint Eastwood $1,378,321,100 $72,543,216 $46,700,000 $75,487,408
## 10 Francis Lawrence $1,358,501,971 $271,700,394 $281,666,058 $135,437,020
## 11 Ron Howard $1,335,988,092 $111,332,341 $101,587,923 $81,933,761
## 12 Gore Verbinski $1,329,600,995 $189,942,999 $123,207,194 $154,473,822
## 13 Andrew Adamson $1,137,446,920 $284,361,730 $279,680,930 $120,895,765
## 14 Shawn Levy $1,129,750,988 $102,704,635 $85,463,309 $65,484,773
## 15 Ridley Scott $1,128,857,598 $80,632,686 $47,775,715 $68,812,285
ratings <- movies%>%
group_by(genre)%>%
summarise(avg_rating = mean(rating), min_rating = min(rating), max_rating = max(rating), median_rating = median(rating), sd_rating = sd(rating))%>%
arrange(desc(avg_rating))
ratings## # A tibble: 17 × 6
## genre avg_rating min_rating max_rating median_rating sd_rating
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Biography 7.11 4.5 8.9 7.2 0.760
## 2 Crime 6.92 4.8 9.3 6.9 0.849
## 3 Mystery 6.86 4.6 8.5 6.9 0.882
## 4 Musical 6.75 6.3 7.2 6.75 0.636
## 5 Drama 6.73 2.1 8.8 6.8 0.917
## 6 Documentary 6.66 1.6 8.5 7.4 1.77
## 7 Sci-Fi 6.66 5 8.2 6.4 1.09
## 8 Animation 6.65 4.5 8 6.9 0.968
## 9 Romance 6.65 6.2 7.1 6.65 0.636
## 10 Adventure 6.51 2.3 8.6 6.6 1.09
## 11 Family 6.5 5.7 7.9 5.9 1.22
## 12 Action 6.23 2.1 9 6.3 1.03
## 13 Fantasy 6.15 4.3 7.9 6.45 0.959
## 14 Comedy 6.11 1.9 8.8 6.2 1.02
## 15 Horror 5.83 3.6 8.5 5.9 1.01
## 16 Western 5.7 4.1 7.3 5.7 2.26
## 17 Thriller 4.8 4.8 4.8 4.8 NA
overall<-ggplot(movies, aes(x=rating))+
geom_histogram(color="black", fill="white")+
geom_vline(aes(xintercept=mean(rating), color = "Average Rating"), linetype = "dashed")+
labs(title = "Rating distribution", subtitle = "A histogram on overall rating across genres", x = "Rating")+
scale_color_manual(name = "Legend", values = c("Average Rating" = "orange"))+ #We adjust the legend to display the proper label for the mean line
theme_bw()
overallrating_by_genre<-ggplot(movies, aes(x=rating))+
geom_histogram(color="black", fill="white")+
facet_wrap(vars(genre), scales = "free_y")+ #we decided that it would be more visually informative, if the y axis adapts from histogram to histogram.
labs(title = "Rating distribution", subtitle = "Histograms per genre")+
theme_bw()
rating_by_genregross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?all_scatter_likes <- ggplot(movies, aes(x=cast_facebook_likes, y=gross))+
geom_point(aes(color=genre))+
geom_smooth(method = "lm")+
scale_x_continuous(labels = number)+ #we use the scales package to enhance the axes' readability
scale_y_continuous(labels = dollar)+
labs(title = "Likes and gross revenue", subtitle = "An undadjusted look at the relationship between casts' facebook popularity and the movie's gross revenue", x = "Casts' likes on facebook", y = "Gross Revenue")+
theme_bw()
all_scatter_likes#In the following, we left out the visualization of outlying values in order to make the graph easier to read.
adjusted_scatter_likes <- ggplot(movies,aes(x=cast_facebook_likes, y=gross))+
geom_point(aes(color=genre))+
geom_smooth(method = "lm")+
xlim(0, 150000)+ #we limit the display of values on the x and y axes. This does not exclude the values from the regression line calculation, but only enhances readability.
ylim(0,500000000)+
labs(title = "Likes and gross revenue", subtitle = "An adjusted look at the relationship between casts' facebook popularity and the movie's gross revenue", x = "Casts' likes on facebook", y = "Gross Revenue")+
theme_bw()
adjusted_scatter_likesFrom the trendline and correlation coefficient of 0.28 one can observe that there is a weak positive correlation between the two variables indicating that the casts’ Facebook popularity may help predict the gross revenue of a movie to a slight extent.
gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.all_scatter_budget <- ggplot(movies, aes(x=budget, y=gross))+
geom_point(aes(color=genre))+
geom_smooth(method = "lm")+
scale_x_continuous(labels = dollar)+
scale_y_continuous(labels = dollar)+
labs(title = "Budget and gross revenue", subtitle = "A look at the relationship between the movie's budget and its gross revenue", x = "Budget", y = "Gross Revenue")+
theme_bw()
all_scatter_budgetCompared to the casts’ facebook popularity, budget seems to be a stronger predictor of a movie’s gross revenue as the correlation coefficient is closer to 1.
gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?all_scatter_rating <- ggplot(movies, aes(x=rating, y=gross))+
geom_point(aes(color=genre), size=0.5)+
ylim(0,750000000)+
geom_smooth(method = "lm")+
facet_wrap(vars(genre), scales = "free_y")+ #we decide to only have a changing y axis across the charts to make identifying smaller values easier
scale_y_continuous(labels = dollar_format(scale = 1/10000000, prefix = "m $"))+ #we use the scales package to scale the y axis to million $
labs(title = "Rating and gross revenue", subtitle = "A look at the relationship between a movie's rating and its gross revenue", x = "Rating", y = "Gross Revenue")+
theme_bw()
all_scatter_ratingOverall, one can observe a positive correlation between a movie’s rating and its gross revenue, however, as with casts’ facebook likes, the correlation is weak. Due to the small sample size of some genres (e.g. Musical, Western and Sci-Fi), not all trendlines are representative.
nyse <- read_csv(here::here("data","nyse.csv"))# YOUR CODE GOES HERE
sector_df<-nyse%>%
group_by(sector)%>%
summarise(number = count(sector))%>%
arrange(desc(number))
sector_df## # A tibble: 12 × 2
## sector number
## <chr> <int>
## 1 Finance 97
## 2 Consumer Services 79
## 3 Public Utilities 60
## 4 Capital Goods 45
## 5 Health Care 45
## 6 Energy 42
## 7 Technology 40
## 8 Basic Industries 39
## 9 Consumer Non-Durables 31
## 10 Miscellaneous 12
## 11 Transportation 10
## 12 Consumer Durables 8
barplot_sector<- ggplot(sector_df, aes(y=fct_reorder(sector, number), x=number))+
geom_col(fill = "darkblue", width = 0.5)+
geom_text(aes(label = number), hjust = 3, color = "white")+ #we display the # of companies and use hjust to move the labels to the left.
labs(title = "Sector differences", subtitle = "The number of companies per sector", y = "Sector", x= "Number of companies")+
theme_bw()
barplot_sectorSPY which is the SP500 ETF (Exchange Traded Fund).# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming,
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- c("GREK","GME","AMC","AAPL","COW","TSLA","SPY" ) %>%
tq_get(get = "stock.prices",
from = "2011-01-01",
to = "2021-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 16,018
## Columns: 8
## Groups: symbol [7]
## $ symbol <chr> "GREK", "GREK", "GREK", "GREK", "GREK", "GREK", "GREK", "GREK…
## $ date <date> 2011-12-08, 2011-12-09, 2011-12-12, 2011-12-13, 2011-12-14, …
## $ open <dbl> 44.8, 44.1, 44.4, 44.8, 41.9, 41.8, 42.0, 40.9, 42.3, 42.1, 4…
## $ high <dbl> 44.9, 44.7, 44.4, 44.8, 41.9, 42.0, 42.2, 43.4, 42.3, 42.1, 4…
## $ low <dbl> 44.2, 44.1, 42.5, 42.0, 40.8, 41.8, 39.9, 40.4, 41.4, 42.1, 4…
## $ close <dbl> 44.3, 44.4, 42.7, 42.3, 41.5, 42.0, 41.2, 41.0, 41.4, 42.1, 4…
## $ volume <dbl> 1167, 667, 2700, 2633, 200, 1267, 3667, 1467, 1100, 167, 700,…
## $ adjusted <dbl> 38.3, 38.4, 36.9, 36.6, 36.0, 36.3, 35.7, 35.5, 35.8, 36.5, 3…
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))SPY; min, max, median, mean, SD.# YOUR CODE GOES HERE
summary_monthly<-myStocks_returns_monthly%>%
group_by(symbol)%>%
summarise(min_return = min(monthly_returns), max_return = max(monthly_returns), median_return = median(monthly_returns), avg_return = mean(monthly_returns), sd_return = sd(monthly_returns))
summary_monthly## # A tibble: 7 × 6
## symbol min_return max_return median_return avg_return sd_return
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL -0.181 0.217 0.0257 0.0245 0.0785
## 2 AMC -0.504 5.25 0.00679 0.0798 0.611
## 3 COW -0.134 0.0762 0.00264 -0.00524 0.0485
## 4 GME -0.687 16.3 -0.0118 0.142 1.45
## 5 GREK -0.332 0.369 0.00609 0.00365 0.110
## 6 SPY -0.125 0.127 0.0174 0.0123 0.0381
## 7 TSLA -0.224 0.811 0.0148 0.0523 0.176
geom_density(), for each of the stocks# YOUR CODE GOES HERE
density<-ggplot(myStocks_returns_monthly, aes(x=monthly_returns))+
geom_density()+
facet_wrap(vars(symbol), scales="free")+ #we use free scales per density plot in order to enhance readability.
labs(title = "Return Distribution", subtitle = "Density plots on monthly returns", x = "Monthly return", y = "Percentage")
densityFrom the density plots, one can observe that volatility in monthly return varies widely across the stocks/ETF. For instance, the returns of COW seem to be more evenly distributed than GME, indicating a lower risk when investing into them. In our eyes, GME seems to be the riskiest stock, while an investment into GREK carries the least risk. The latter is understandable as the GREK ETF is diversified and holds shares of the largest and most liquid companies in Greece (source).
Looking at the x axis, one can also observe a large difference in the range of monthly returns across the instruments. Interestingly, most of the distribution lies around the 0 mark for all displayed tickers.
ggrepel::geom_text_repel() to label each stock# YOUR CODE GOES HERE
scatter_return_volatility <- ggplot(summary_monthly, aes(x=sd_return, y=avg_return, label=symbol))+
geom_point()+
geom_smooth(method = "lm", se = F)+ #in this case, we do not want a confidence interval around the regression line
geom_text_repel()+
labs(title = "Risk vs. Reward", subtitle = "A look at the relationship between the instruments' standard deviation and average monthly return", x = "Standard deviation", y = "Average monthly return")+
theme_bw()
scatter_return_volatilityGenerally, the market seems to reward higher risk with higher return. However, there are minor exceptions in our sample: GREK, indexing the Greek market, has a higher SD than SPY, yet its average monthly return is lower. A good approximation for comparing risk and reward across instruments is the blue linear regression line. Tickers above this line indicate a greater reward for a given risk compared to tickers below that line
hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
head(hr_cleaned)## # A tibble: 6 × 19
## age attrition daily_rate department distance_from_ho… education gender
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 41 Yes 1102 Sales 1 College Female
## 2 49 No 279 Research & Dev… 8 Below Col… Male
## 3 37 Yes 1373 Research & Dev… 2 College Male
## 4 33 No 1392 Research & Dev… 3 Master Female
## 5 27 No 591 Research & Dev… 2 Below Col… Male
## 6 32 No 1005 Research & Dev… 2 College Male
## # … with 12 more variables: job_role <chr>, environment_satisfaction <chr>,
## # job_satisfaction <chr>, marital_status <chr>, monthly_income <dbl>,
## # num_companies_worked <dbl>, percent_salary_hike <dbl>,
## # performance_rating <chr>, total_working_years <dbl>,
## # work_life_balance <chr>, years_at_company <dbl>,
## # years_since_last_promotion <dbl>
skim(hr_cleaned)| Name | hr_cleaned |
| Number of rows | 1470 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| attrition | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| department | 0 | 1 | 5 | 22 | 0 | 3 | 0 |
| education | 0 | 1 | 6 | 13 | 0 | 5 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| job_role | 0 | 1 | 7 | 25 | 0 | 9 | 0 |
| environment_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| job_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| marital_status | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| performance_rating | 0 | 1 | 9 | 11 | 0 | 2 | 0 |
| work_life_balance | 0 | 1 | 3 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 36.92 | 9.14 | 18 | 30 | 36 | 43 | 60 | ▂▇▇▃▂ |
| daily_rate | 0 | 1 | 802.49 | 403.51 | 102 | 465 | 802 | 1157 | 1499 | ▇▇▇▇▇ |
| distance_from_home | 0 | 1 | 9.19 | 8.11 | 1 | 2 | 7 | 14 | 29 | ▇▅▂▂▂ |
| monthly_income | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911 | 4919 | 8379 | 19999 | ▇▅▂▁▂ |
| num_companies_worked | 0 | 1 | 2.69 | 2.50 | 0 | 1 | 2 | 4 | 9 | ▇▃▂▂▁ |
| percent_salary_hike | 0 | 1 | 15.21 | 3.66 | 11 | 12 | 14 | 18 | 25 | ▇▅▃▂▁ |
| total_working_years | 0 | 1 | 11.28 | 7.78 | 0 | 6 | 10 | 15 | 40 | ▇▇▂▁▁ |
| years_at_company | 0 | 1 | 7.01 | 6.13 | 0 | 3 | 5 | 9 | 40 | ▇▂▁▁▁ |
| years_since_last_promotion | 0 | 1 | 2.19 | 3.22 | 0 | 0 | 1 | 3 | 15 | ▇▁▁▁▁ |
The hr_cleaned dataset holds fictional information of a total of 1470 employees on variables such as the employees’ education, job specifics, performance and attrition. All in all, there are 9 numeric and 10 non-numeric variables
Given the wide range of information, one can try to make inferences about dependencies among variables. For example, it might be interesting to find out if a combination of variables might be a good predictor for an employee’s decision to leave the company (attrition = Yes).
attrition)number_attrition<-hr_cleaned%>%
group_by(attrition)%>%
summarize(number_count = count(attrition))%>%
mutate(perc = number_count/sum(number_count)) #we introduce a second column, namely percent, to represent the count as a percentage of the total
number_attrition## # A tibble: 2 × 3
## attrition number_count perc
## <chr> <int> <dbl>
## 1 No 1233 0.839
## 2 Yes 237 0.161
Looking at the sheer counts of YES and NO entries for the attrition coloumn, around 237 of all employees have left the company, while 1233 have been staying employed. This corresponds to around 16% of all recorded employees.
age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?n<-length(hr_cleaned$age)
pic_data<-data.frame(cbind(c(hr_cleaned$age,hr_cleaned$monthly_income,
hr_cleaned$years_since_last_promotion),
c(rep("age",n),rep("monthly_income",n),rep("years_since_last_promotion",n))))
colnames(pic_data)<-c("Num","Attribute")
pic_data$Num<-as.integer(pic_data$Num)
distribution<-ggplot(pic_data, aes(x=Num))+
geom_histogram()+
facet_wrap(vars(Attribute), scale="free")+
labs(title = "Distribution across the dataset", subtitle = "Counts of values for age, monthly income and years since last promotion", x="",y="Frequency")+
theme_bw()
distribution job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of totalhr_satis <- hr_cleaned %>%
group_by(job_satisfaction) %>%
count(sort=TRUE) %>%
mutate(percent=n/1470)
hr_satis ## # A tibble: 4 × 3
## # Groups: job_satisfaction [4]
## job_satisfaction n percent
## <chr> <int> <dbl>
## 1 Very High 459 0.312
## 2 High 442 0.301
## 3 Low 289 0.197
## 4 Medium 280 0.190
work_life_satis <- hr_cleaned %>%
group_by(work_life_balance) %>%
count(sort=TRUE) %>%
mutate(percent=n/1470)
work_life_satis ## # A tibble: 4 × 3
## # Groups: work_life_balance [4]
## work_life_balance n percent
## <chr> <int> <dbl>
## 1 Better 893 0.607
## 2 Good 344 0.234
## 3 Best 153 0.104
## 4 Bad 80 0.0544
monthly_income and education? monthly_income and gender?plot1 <- ggplot(hr_cleaned, aes(x = fct_reorder(education, -monthly_income), y=monthly_income))+ #we include a negative "-" sign infront of monthly_income in the fct_reorder function, in order to sort the x axis based on decreasing y values
geom_boxplot()+
scale_y_continuous(labels = dollar)+
labs(title = "Education vs income", subtitle = "A look at an employee's academic background and monthly income", x = "Education", y = "Monthly income")+
theme_bw()
plot1 plot2 <-ggplot(hr_cleaned, aes(x = fct_reorder(gender, -monthly_income), y=monthly_income))+
geom_boxplot()+
scale_y_continuous(label = dollar)+
labs(title = "Gender vs income", subtitle = "A look at an employee's sex and monthly income", x = "Gender", y = "Monthly income")+
theme_bw()
plot2 income_vs_job <- ggplot(hr_cleaned, aes(x=fct_reorder(job_role, -monthly_income), y=monthly_income)) +
geom_boxplot()+
scale_y_continuous(labels = dollar)+
labs(title = "Income distribution by job", subtitle = "A look at the relationship between job titles and monthly income", x = "Job" , y="Monthly income")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) #we adjust the labels on the x axis such that they are vertically aligned
income_vs_job#We opted for plotting the MEDIAN monthly income, as it is typically less prone to outliers
median_income <- hr_cleaned%>%
group_by(education)%>%
summarise(median_monthly_income = median(monthly_income))%>%
ggplot(aes(x = fct_reorder(education, -median_monthly_income), y = median_monthly_income))+
geom_col(fill = "lightblue")+
scale_y_continuous(labels = dollar)+
labs(title = "Median income by education", subtitle = "A look at the relationship between academic background and median monthly income", x = "Education" , y="Median monthly income")+
theme_bw()
median_incomeggthemesplot7 <- hr_cleaned %>%
group_by(education) %>%
ggplot(aes(x=monthly_income))+
geom_histogram(fill = "lightblue")+
facet_wrap(vars(education), scales = "free_y")+
scale_x_continuous(labels = dollar)+
labs(title = "Income distribution by education", subtitle = "A look at the relationship between edcuation level and monthly income", x = "Monthly income" , y="Count")+
ggthemes::theme_stata()+
theme(plot.background = element_blank()) #we remove the blue background to make the desing more coherent
plot7 job_roleplot8 <- ggplot(hr_cleaned, aes(x=age, y=monthly_income))+
geom_point(size=0.1)+
geom_smooth(method = "lm")+
facet_wrap(vars(job_role), scales = "free")+
scale_y_continuous(labels = dollar)+
labs(title = "Income by age and job title", subtitle = "A look at the relationship between age, job title and monthly income level", x = "Age" , y="Monthly income")+
ggthemes::theme_stata()+
theme(plot.background = element_blank())
plot8 We would like to replicate the following chart.
First we do some cleaning. As the vaccination data is time series, but we are only interested in the most recent status of vacc. percent per county, we can limit our dataframe to Sep. 3rd data.
In addition, we add a column to the election dataset which calculates the number of votes for one candidate as a percentage of the total number of votes in every given county.
#as there are multiple dates for vacc statuses, and we are only interested in the most recent one, we can do some cleaning. Data from Sep. 3rd, 2021, seems to be most recent.
vaccinations_recent <- vaccinations %>%
filter(date=="09/03/2021")
#we can also include an additional coloumn to the election dataset, indicating the number of votes for one candidate as a percentage of whole
elections_perc <- election2020_results%>%
mutate(perc_votes = candidatevotes/totalvotes)Next up, we create a new dataframe, based on joined values from the three original ones. Here, we only include the data really necessary for the reproduction of the graph.
Let’s have a sneak peak at the output dataframe. It looks like this is all the data we really need.
#We decide to just join coloumn series_complete_pop_pct from the vacc dataset, as this should suffice to recreate the graph. We use inner join here, as we have a key variable, i.e. fips, to join all three datasets by.
#We also rename the coloumns to make their names a little more easy on the tongue.
joined_frame <- inner_join(population, vaccinations_recent[,c("fips", "series_complete_pop_pct") ], by = "fips")%>%
rename("vacc_perc" = "series_complete_pop_pct", "pop" = "pop_estimate_2019")
#Similarly, we join parts of the cleaned election data, and filter out records other than votes for Trump, as they are not needed for the graph. In addition, we sum up perc_votes because there is sometimes data for different ways of voting, i.e. early vote, election day, provisional.
joined_frame <- inner_join(joined_frame, elections_perc[,c("fips", "candidate", "perc_votes")])%>%
filter(candidate=="DONALD J TRUMP")%>%
group_by(fips, pop, vacc_perc, candidate, )%>%
summarise(sum(perc_votes))%>%
rename("perc_votes_total" = "sum(perc_votes)") #sum, to take into consideration all kinds of voting methods.
head(joined_frame) #we can have a look at the dataset. It looks like we have all the data we need for visualization## # A tibble: 6 × 5
## # Groups: fips, pop, vacc_perc [6]
## fips pop vacc_perc candidate perc_votes_total
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 01001 55869 30.5 DONALD J TRUMP 0.714
## 2 01003 223234 37.9 DONALD J TRUMP 0.762
## 3 01005 24686 31.3 DONALD J TRUMP 0.535
## 4 01007 22394 26.6 DONALD J TRUMP 0.784
## 5 01009 57826 22.6 DONALD J TRUMP 0.896
## 6 01011 10101 38.4 DONALD J TRUMP 0.248
Now that we have created the dataframe with all necessary data, let’s get to the attractive part of this exercise: making a beautiful plot.
plot_counties <- ggplot(joined_frame, aes(x=perc_votes_total, vacc_perc))+
geom_rect(aes(xmin = 0, xmax = 0.45, ymin = 0, ymax = 100), fill = "#A6A8FB")+ #we add three boxes to the plot that have HEX values for the blue, purple and red colours respectively.
geom_rect(aes(xmin = 0.45, xmax = 0.55, ymin = 0, ymax = 100), fill = "#D2A6D2")+
geom_rect(aes(xmin = 0.55, xmax = 1, ymin = 0, ymax = 100), fill = "#FDA7A7")+
geom_point(aes(size=pop), alpha = 1/3)+
geom_smooth(method = "lm")+
labs(title = "COVID19 vaccination levels versus Trump voting", subtitle = "A look at the relationship between 2020 US election results and percentages of fully vaccinated people, per county as of Sep. 3rd, 2021", x = "Trump voting", y = "County's percentage of fully vaccinated people")+
theme_bw()
plot_countiesIt seems like there are a number of outliers in the dataset. 276 counties have reported a full vaccination count of 0 for the Sep. 3rd collection. For instance, as you can see in the following output dataframe, Bexar county, home to a population of 2003554, reported a vaccination rate of 0%.
filter(joined_frame[ ,c("fips", "pop", "vacc_perc")], fips==48029)## # A tibble: 1 × 3
## # Groups: fips, pop, vacc_perc [1]
## fips pop vacc_perc
## <chr> <dbl> <dbl>
## 1 48029 2003554 0
It might be appropriate, to exclude records with vacc_perc = 0 from out dataset, as it seems very hard to believe that in a county like Bexar, home to 2003554 people, not a single one is fully vaccinated. A data collection error like this potentially has a large impact on the regression line. We therefore create an adjusted DF.
#we first exclude records with vacc_perc equal to 0
adj_joined_fram <- joined_frame%>%
filter(vacc_perc!=0)
#here, we simply replicate the plot, but this time with adjusted values.
adj_plot_counties <- ggplot(adj_joined_fram, aes(x = perc_votes_total, y = vacc_perc))+
geom_rect(aes(xmin = 0, xmax = 0.45, ymin = 0, ymax = 100), fill = "#A6A8FB")+
geom_rect(aes(xmin = 0.45, xmax = 0.55, ymin = 0, ymax = 100), fill = "#D2A6D2")+
geom_rect(aes(xmin = 0.55, xmax = 1, ymin = 0, ymax = 100), fill = "#FDA7A7")+
geom_point(aes(size = pop), alpha = 1/3)+
geom_smooth(method = "lm")+
labs(title = "ADJUSTED - COVID19 vaccination levels versus Trump voting", subtitle = "An adjusted look at the relationship between 2020 US election results and percentages of fully vaccinated people, per county as of Sep. 3rd, 2021", x = "Trump voting", y = "County's percentage of fully vaccinated people")+
theme_bw()
adj_plot_countiesurl <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)Now that we have fetched all the data from the wikipedia table into our german_election_polls data frame, let’s try to visualize our data.
plot_election_polls <- ggplot(german_election_polls, aes(x = end_date))+ #because we have multiple lines (representing multiple parties), we only map the x axis uptop here. We map y-values when we create every scatter and line plot separately for each party.
geom_point(aes(y = union), color = "black", alpha = 0.3)+
geom_line(aes(y=rollmean(union, 14, na.pad=TRUE), color = "black"))+ #we add a line representing the rolling average of values for the past 14 days. We also adjust the padding for enhanced visualization.+
geom_point(aes(y = spd), color = "red", alpha = 0.3)+
geom_line(aes(y=rollmean(spd, 14, na.pad=TRUE), color = "red"))+
geom_point(aes(y = grune), color = "darkgreen", alpha = 0.3)+
geom_line(aes(y=rollmean(grune, 14, na.pad=TRUE), color = "darkgreen"))+
geom_point(aes(y = af_d), color = "blue", alpha = 0.3)+
geom_line(aes(y=rollmean(af_d, 14, na.pad=TRUE), color = "blue"))+
geom_point(aes(y = fdp), color = "#F4E332", alpha = 0.3)+
geom_line(aes(y=rollmean(fdp, 14, na.pad=TRUE), color = "#F4E332"))+
geom_point(aes(y=linke), color = "darkred", alpha = 0.3)+
geom_line(aes(y=rollmean(linke, 14, na.pad = TRUE), color = "darkred"))+
#we have desperately tried to add a functioning legend to our plot. We finally managed to do so with the scale color identity function, after mapping the correct colors to the variable aesthetics before. Inspiration for this code came from https://aosmith.rbind.io/2018/07/19/manual-legends-ggplot2/
scale_color_identity(breaks = c("black", "red", "darkgreen", "blue", "#F4E332", "darkred"),
labels = c("CDU/CSU", "SPD", "Grüne", "AfD", "FDP", "Linke"),
guide = "legend",
name = "Legend")+
labs(title = "German elections", subtitle = "Change in estimated voting during 2021", x="2021 Poll Month", y="Voting Percentage")+
theme_bw()
plot_election_pollsCheck minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.